Predicting Tennis Players' Ranks Based on Physical Attributes¶

If you know a bit about tennis, you've likely heard of Roger Federer. World famous tennis player Roger Federer, who achieved a peak rank of number 1, is a true tennis genius. From our dataset, we know that he is right-handed, and he uses one hand to hit backhand shots. Could Federer’s success be due to which hand or how many hands he uses? Or might it depend on other factors like height and weight? We will examine player statistics to investigate these questions. The dataset we will use is “Player Stats for Top 500 Players” from https://www.ultimatetennisstatistics.com. Using this dataset, we will attempt to build a model to predict a new player’s success in professional tennis.

Methods¶

We first perform some preliminary summarization and analysis of the data.

In [1]:
# loading libraries

library(tidyverse)
library(testthat)
library(digest)
library(repr)
library(tidymodels)
library(GGally)
library(ISLR)

install.packages("plotly")
library(plotly)
library(reshape2)

install.packages("png")
library(png)

options(repr.matrix.max.rows = 7)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

✔ ggplot2 3.3.6     ✔ purrr   0.3.4
✔ tibble  3.1.7     ✔ dplyr   1.0.9
✔ tidyr   1.2.0     ✔ stringr 1.4.0
✔ readr   2.1.2     ✔ forcats 0.5.1

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()


Attaching package: ‘testthat’


The following object is masked from ‘package:dplyr’:

    matches


The following object is masked from ‘package:purrr’:

    is_null


The following objects are masked from ‘package:readr’:

    edition_get, local_edition


The following object is masked from ‘package:tidyr’:

    matches


── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──

✔ broom        1.0.0     ✔ rsample      1.0.0
✔ dials        1.0.0     ✔ tune         1.0.0
✔ infer        1.0.2     ✔ workflows    1.0.0
✔ modeldata    1.0.0     ✔ workflowsets 1.0.0
✔ parsnip      1.0.0     ✔ yardstick    1.0.0
✔ recipes      1.0.1     

── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard()   masks purrr::discard()
✖ dplyr::filter()     masks stats::filter()
✖ recipes::fixed()    masks stringr::fixed()
✖ testthat::is_null() masks purrr::is_null()
✖ dplyr::lag()        masks stats::lag()
✖ rsample::matches()  masks testthat::matches(), dplyr::matches(), tidyr::matches()
✖ yardstick::spec()   masks readr::spec()
✖ recipes::step()     masks stats::step()
• Dig deeper into tidy modeling with R at https://www.tmwr.org

Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2

Updating HTML index of packages in '.Library'

Making 'packages.html' ...
 done


Attaching package: ‘plotly’


The following object is masked from ‘package:ggplot2’:

    last_plot


The following object is masked from ‘package:stats’:

    filter


The following object is masked from ‘package:graphics’:

    layout



Attaching package: ‘reshape2’


The following object is masked from ‘package:tidyr’:

    smiths


Updating HTML index of packages in '.Library'

Making 'packages.html' ...
 done

First, we read in our data. We select variables of interest, then tidy it, formatting cells to remove parenthetical information and units of measurement.

In [2]:
## reading data (from https://www.ultimatetennisstatistics.com/)

tennis_data <- read_csv("https://drive.google.com/uc?export=download&id=1_MECmUXZuuILYeEOfonSGqodW6qVdhsS")

## organizing/tidying data

colnames(tennis_data) <- make.names(colnames(tennis_data))

tennis <- tennis_data |>
            select(Current.Rank, Age, Height, Weight, Plays, Backhand, Favorite.Surface) |> # select relevant variables
            mutate(across(everything(), function(col) {gsub(" .*", "", col)})) |> # format cells
            mutate(across(Current.Rank:Weight, as.numeric)) |> # convert chr to dbl
            mutate(across(Plays:Favorite.Surface, as.factor)) # convert chr to fct

# tennis <- tennis_data |>
#             select(Current.Rank, Age, Height) |> # select relevant variables
#             mutate(across(everything(), function(col) {gsub(" .*", "", col)})) |> # format cells
#             mutate(across(Current.Rank:Height, as.numeric)) |> # conver chr to dbl
#             drop_na() # remove observations with missing values

paste("Table 1. Tidied dataset.")
tennis
New names:
• `` -> `...1`
Rows: 500 Columns: 38
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (25): Age, Country, Plays, Wikipedia, Current Rank, Best Rank, Name, Bac...
dbl (13): ...1, Turned Pro, Seasons, Titles, Best Season, Retired, Masters, ...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
'Table 1. Tidied dataset.'
A tibble: 500 × 7
Current.RankAgeHeightWeightPlaysBackhandFavorite.Surface
<dbl><dbl><dbl><dbl><fct><fct><fct>
37826 NANARight-handedNA NA
32618 NANALeft-handed Two-handedNA
17832185NARight-handedTwo-handedFast
23621 NANARight-handedTwo-handedNA
⋮⋮⋮⋮⋮⋮⋮
49523NANALeft-handedNA NA
41924NANANA NA NA
45122NANALeft-handedTwo-handedNA

We split our data into a training and testing set so that we may evaluate our model's accuracy later.

In [3]:
set.seed(2022)

tennis_split <- initial_split(tennis, prop = 0.75, strata = Current.Rank)

tennis_training <- training(tennis_split)
tennis_testing <- testing(tennis_split)

To start, we determine the percentage of missing values for each variable, as well as calculate the means for numerical values and modes for factor values.

In [4]:
## exploratory data analysis

# calculating missing percentages

missing_num <- tennis_training |>
                select(where(is.numeric)) |>
                pivot_longer(everything(), names_to = "var", values_to = "val") |>
                mutate(missing = is.na(val)) |>
                group_by(var, missing) |>
                summarise(n = n()) |>
                mutate(percentage_missing = (n / nrow(tennis)) * 100)

missing_fct <- tennis_training |>
                select(where(is.factor)) |>
                pivot_longer(everything(), names_to = "var", values_to = "val") |>
                mutate(missing = is.na(val)) |>
                group_by(var, missing) |>
                summarise(n = n()) |>
                mutate(percentage_missing = n / nrow(tennis) * 100)

missing_all <- bind_rows(missing_num, missing_fct) |>
                filter(missing) |>
                select(var, percentage_missing)

# summary statistics

summary_num <- tennis_training |>
                select(where(is.numeric)) |>
                summarize_all(mean, na.rm = TRUE)

find_mode <- function(v) # calculates mode of given vector
{
    tibble(var = v, y = 0) |>
        na.omit() |>
        group_by(var) |>
        summarize(n = n()) |>
        slice_max(n) |>
        select(var) |>
        pull()
}

summary_fct <- tennis_training |>
                select(where(is.factor)) |>
                summarize_all(find_mode)

summary_all <- bind_cols(summary_num, summary_fct)

paste("Table 2. Missing percentages of variables.")
missing_all # shows percentage of missing values for each variable (some have very high % missing)
paste("Table 3. Summary statistics of variables.")
summary_all # shows summary statistics for each variable; means of numeric, modes of factors
`summarise()` has grouped output by 'var'. You can override using the `.groups`
argument.
`summarise()` has grouped output by 'var'. You can override using the `.groups`
argument.
'Table 2. Missing percentages of variables.'
A grouped_df: 7 × 2
varpercentage_missing
<chr><dbl>
Age 0.2
Current.Rank 0.8
Height 56.8
Weight 71.6
Backhand 14.0
Favorite.Surface38.0
Plays 6.4
'Table 3. Summary statistics of variables.'
A tibble: 1 × 7
Current.RankAgeHeightWeightPlaysBackhandFavorite.Surface
<dbl><dbl><dbl><dbl><fct><fct><fct>
248.268326.05376186.23684.46667Right-handedTwo-handedClay

As the weight variable simply has too many missing values, we will discard it.

Now, we visualize potential correlations between the rest of the variables and current rank.

In [5]:
## data visualization

# defining variables/functions

point_alpha <- 0.6
point_size <- 2.5

theme_layer <- theme(text = element_text(size = 20))

set_size <- function(p)
{
    switch(
        p,
        "age" = options(repr.plot.width = 8, repr.plot.height = 7),
        "height" = options(repr.plot.width = 7, repr.plot.height = 6)
    )
}
In [6]:
# plotting variables of interest against current rank

age_plot <- tennis_training |>
                ggplot(aes(x = Age, y = Current.Rank)) +
                geom_point(alpha = point_alpha, colour = "coral4", size = point_size) +
                labs(x = "Age (years)", y = "Current Rank") +
                ggtitle("Figure 1. Current Rank vs. Age (years)") +
                theme_layer

height_plot <- tennis_training |>
                ggplot(aes(x = Height, y = Current.Rank)) +
                geom_point(alpha = point_alpha, colour = "darkolivegreen", size = point_size) +
                ylim(0, 300) +
                labs(x = "Height (cm)", y = "Current Rank") +
                ggtitle("Figure 2. Current Rank vs. Height (cm)") +
                theme_layer

set_size("age")
age_plot
set_size("height")
height_plot
Warning message:
“Removed 4 rows containing missing values (geom_point).”
Warning message:
“Removed 299 rows containing missing values (geom_point).”

The current rank vs age plot seems to show a very weak negative relationship.
The current rank vs height plot very loosely takes on the shape of a normal distribution.

In [7]:
# colouring by handedness

age_plot_c_p <- tennis_training |>
                ggplot(aes(x = Age, y = Current.Rank, colour = Plays)) +
                geom_point(alpha = point_alpha, size = point_size) +
                labs(x = "Age (years)", y = "Current Rank") +
                ggtitle("Figure 3. Current Rank vs. Age (years)") +
                theme_layer

height_plot_c_p <- tennis_training |>
                ggplot(aes(x = Height, y = Current.Rank, colour = Plays)) +
                geom_point(alpha = point_alpha, size = point_size) +
                ylim(0, 300) +
                labs(x = "Height (cm)", y = "Current Rank") +
                ggtitle("Figure 4. Current Rank vs. Height (cm)") +
                theme_layer

set_size("age")
age_plot_c_p
set_size("height")
height_plot_c_p
Warning message:
“Removed 4 rows containing missing values (geom_point).”
Warning message:
“Removed 299 rows containing missing values (geom_point).”
In [8]:
# colouring by backhand

age_plot_c_bh <- tennis_training |>
                ggplot(aes(x = Age, y = Current.Rank, colour = Backhand)) +
                geom_point(alpha = point_alpha, size = point_size) +
                labs(x = "Age (years)", y = "Current Rank") +
                ggtitle("Figure 5. Current Rank vs. Age (years)") +
                theme_layer

height_plot_c_bh <- tennis_training |>
                ggplot(aes(x = Height, y = Current.Rank, colour = Backhand)) +
                geom_point(alpha = point_alpha, size = point_size) +
                ylim(0, 300) +
                labs(x = "Height (cm)", y = "Current Rank") +
                ggtitle("Figure 6. Current Rank vs. Height (cm)") +
                theme_layer

set_size("age")
age_plot_c_bh
set_size("height")
height_plot_c_bh
Warning message:
“Removed 4 rows containing missing values (geom_point).”
Warning message:
“Removed 299 rows containing missing values (geom_point).”
In [9]:
# colouring by favourite surface

age_plot_c_fs <- tennis_training |>
                ggplot(aes(x = Age, y = Current.Rank, colour = Favorite.Surface)) +
                geom_point(alpha = point_alpha, size = point_size) +
                labs(x = "Age (years)", y = "Current Rank", colour = "Favourite Surface") +
                ggtitle("Figure 7. Current Rank vs. Age (years)") +
                theme_layer

height_plot_c_fs <- tennis_training |>
                ggplot(aes(x = Height, y = Current.Rank, colour = Favorite.Surface)) +
                geom_point(alpha = point_alpha, size = point_size) +
                ylim(0, 300) +
                labs(x = "Height (cm)", y = "Current Rank", colour = "Favourite Surface") +
                ggtitle("Figure 8. Current Rank vs. Height (cm)") +
                theme_layer

set_size("age")
age_plot_c_fs
set_size("height")
height_plot_c_fs
Warning message:
“Removed 4 rows containing missing values (geom_point).”
Warning message:
“Removed 299 rows containing missing values (geom_point).”

Analyzing the coloured plots, there does not seem to be any correlation between current rank and handedness, backhand, or favourite surface, and so we discard those as well.

In [10]:
tennis_training <- select(tennis_training, Current.Rank, Age, Height) |> drop_na()
tennis_testing <- select(tennis_testing, Current.Rank, Age, Height) |> drop_na()
paste("Table 4")
tennis_training
paste("Table 5")
tennis_testing
'Table 4'
A tibble: 89 × 3
Current.RankAgeHeight
<dbl><dbl><dbl>
12131198
4532198
10530180
1232178
⋮⋮⋮
39230183
43028183
45929178
'Table 5'
A tibble: 26 × 3
Current.RankAgeHeight
<dbl><dbl><dbl>
3229175
133185
25532185
18031180
⋮⋮⋮
13829193
10827180
17934183

We will use K-nearest neighbors regression to predict tennis players' ranks. Regression will be used because we will be predicting a numerical variable. This model works best with our data because we plan to use multiple predictors in our analysis, creating a non-linear relationship between our variables.

As part of our preliminary analysis, it was found that age and height were the predictors that showed the most correlation with rank. Other variables we initially intended to use either did not seem to display any significant relationship with rank (handedness, backhand, favourite surface) or did not have enough data points (weight).

In order to achieve the best results, we will create and test several models with different combinations of our variables of interest (age and height), and select the combination which yields the highest accuracy.

First we will use only age.

In [11]:
set.seed(2022)

# Predictors: age
# model, recipe, cross-validation

knn_spec <- nearest_neighbor(weight_func = "rectangular",
                             neighbors = tune()) |>  
            set_engine("kknn") |>  
            set_mode("regression")

tennis_age_recipe <- recipe(Current.Rank ~ Age, data = tennis_training) |>  
            step_scale(all_predictors()) |>  
            step_center(all_predictors())

tennis_age_vfold <- vfold_cv(tennis_training, v = 5, strata = Current.Rank)

tennis_age_wkflw <- workflow() |>  add_recipe(tennis_age_recipe) |>  add_model(knn_spec)

# gridvals <- tibble(neighbors = seq(1, 66))

tennis_age_results <- tennis_age_wkflw |>
             tune_grid(resamples = tennis_age_vfold, grid = 65) |>
             collect_metrics() |>
             filter(.metric == "rmse")

tennis_age_min <- tennis_age_results |> filter(mean == min(mean))
paste("Table 6")
tennis_age_min
'Table 6'
A tibble: 1 × 7
neighbors.metric.estimatormeannstd_err.config
<int><chr><chr><dbl><int><dbl><chr>
13rmsestandard131.565155.682429Preprocessor1_Model13
In [12]:
# Retrain data on training set, then predict with testing

# re-train KNN regression model on the training data set
tennis_age_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 13) |>  
            set_engine("kknn") |>  
            set_mode("regression")

tennis_age_fit <- workflow() |>  
        add_recipe(tennis_age_recipe) |>  
        add_model(tennis_age_spec) |>  
        fit(data = tennis_training)

# predict rmpse with testing
tennis_age_rmspe <- tennis_age_fit |>  
            predict(tennis_testing) |>  
            bind_cols(tennis_testing) |>  
            metrics(truth = Current.Rank, estimate = .pred) |>  
            filter(.metric == 'rmse') |>
            select(.estimate) |>
            pull()

tennis_age_rmspe
94.5684024883759

Next, we use only height.

In [13]:
set.seed(2022)

# Predictors: height
# model, recipe, cross-validation 

tennis_height_recipe <- recipe(Current.Rank ~ Height, data = tennis_training) |>  
            step_scale(all_predictors()) |>  
            step_center(all_predictors())

tennis_height_vfold <- vfold_cv(tennis_training, v = 5, strata = Current.Rank)

tennis_height_wkflw <- workflow() |>  add_recipe(tennis_height_recipe) |>  add_model(knn_spec)

#gridvals <- tibble(neighbors = seq(1, 66))

tennis_height_results <- tennis_height_wkflw |>
             tune_grid(resamples = tennis_height_vfold, grid = 65) |>
             collect_metrics() |>
             filter(.metric == "rmse")

tennis_height_min <- tennis_age_results |>  filter(mean == min(mean))
paste("Table 7")
tennis_height_min
'Table 7'
A tibble: 1 × 7
neighbors.metric.estimatormeannstd_err.config
<int><chr><chr><dbl><int><dbl><chr>
13rmsestandard131.565155.682429Preprocessor1_Model13
In [14]:
#re-train KNN regression model on the training data set
tennis_height_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 13) |>  
            set_engine("kknn") |>  
            set_mode("regression")

tennis_height_fit <- workflow() |>  
        add_recipe(tennis_height_recipe) |>  
        add_model(tennis_height_spec) |>  
        fit(data = tennis_training)

#predict rmpse with testing
tennis_height_rmspe <- tennis_height_fit |>  
            predict(tennis_testing) |>  
            bind_cols(tennis_testing) |>  
            metrics(truth = Current.Rank, estimate = .pred) |>  
            filter(.metric == 'rmse') |>
            select(.estimate) |>
            pull()

tennis_height_rmspe
103.680719639396

Finally, we will construct a model using both age and height.

In [15]:
set.seed(2022)

# Predictors: age and height

tennis_both_recipe <- recipe(Current.Rank ~ Age + Height, data = tennis_training) |>  
            step_scale(all_predictors()) |>
            step_center(all_predictors())

tennis_both_vfold <- vfold_cv(tennis_training, v = 5, strata = Current.Rank)

tennis_both_wkflw <- workflow() |>  add_recipe(tennis_both_recipe) |>  add_model(knn_spec)

#gridvals <- tibble(neighbors = seq(1, 66))

tennis_both_results <- tennis_both_wkflw |>
             tune_grid(resamples = tennis_both_vfold, grid = 65) |>
             collect_metrics() |>
             filter(.metric == "rmse")

tennis_both_min <- tennis_both_results |>  filter(mean == min(mean))
paste("Table 8")
tennis_both_min
'Table 8'
A tibble: 1 × 7
neighbors.metric.estimatormeannstd_err.config
<int><chr><chr><dbl><int><dbl><chr>
15rmsestandard128.110856.185088Preprocessor1_Model15
In [16]:
# re-train KNN regression model on the training data set
tennis_both_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 15) |>  
            set_engine("kknn") |>  
            set_mode("regression")

tennis_both_fit <- workflow() |>  
        add_recipe(tennis_both_recipe) |>  
        add_model(tennis_both_spec) |>  
        fit(data = tennis_training)

#predict rmpse with testing
tennis_both_rmspe <- tennis_both_fit |>  
            predict(tennis_testing) |>  
            bind_cols(tennis_testing) |>  
            metrics(truth = Current.Rank, estimate = .pred) |>  
            filter(.metric == 'rmse') |>
            select(.estimate) |>
            pull()
tennis_both_rmspe
95.0942986778158

We see that using only age as the predictor gives us the smallest root mean square prediction error (RMSPE). Therefore, we will model the predictions of current rank using only age.

Plotting our predictions:

In [17]:
#plot for rank vs age 
tennis_age_preds <- tennis_age_fit |>  
            predict(tennis_testing) |>  
            bind_cols(tennis_testing)

tennis_age_plot <- tennis_age_preds |>
    ggplot(aes(x = Age, y = Current.Rank)) +
    geom_point() +
    geom_line(data = tennis_age_preds,
              aes(x = Age, y = .pred),
              color = "blue") +
    labs(x = "Age (yr)", y = "Predicted rank") +
    ggtitle("Figure 9. Predicted Rank based on Age, K = 13") +
    theme(text = element_text(size = 15))

tennis_age_plot
In [18]:
#plot of rank vs height
tennis_height_preds <- tennis_height_fit |>  
            predict(tennis_testing) |>  
            bind_cols(tennis_testing)

tennis_height_plot <- tennis_height_preds |>
    ggplot(aes(x = Height, y = Current.Rank)) +
    geom_point() +
    geom_line(data = tennis_height_preds,
              aes(x = Height, y = .pred),
              color = "red") +
    labs(x = "Height (cm)", y = "Predicted rank") +
    ggtitle("Figure 10. Predicted Rank based on Height, K = 13") +
    theme(text = element_text(size = 15))

tennis_height_plot
In [19]:
tennis_both_preds <- tennis_both_fit |>
            predict(tennis_testing) |>
            bind_cols(tennis_testing)

surface <- select(tennis_both_preds, age = Age, height = Height, pred = .pred)
surface <- acast(surface, age ~ height, value.var = "pred", fun.aggregate = sum)

# bad surface predictions
tennis_both_plot <- tennis_both_preds |>
    plot_ly(x = ~Age, y = ~Height, z = ~Current.Rank, type = "scatter3d", mode = "markers", size = 1.5) |>
    add_surface(x = tennis_both_preds$Age,
                y = tennis_both_preds$Height,
                z = surface,
                colorbar = list(title = "Predicted Rank")) |>
    layout(scene = list(xaxis = list(title = "Age (yr)"),
                        yaxis = list(title = "Height (cm)"),
                        zaxis = list(title = "Rank")),
          title = "Figure 11. Predicted Rank based on Age and Height, K = 15")

tennis_both_plot

# 3d plot doesn't show up in .html, please refer to .ipynb
Warning message:
“'surface' objects don't have these attributes: 'mode'
Valid attributes include:
'_deprecated', 'autocolorscale', 'cauto', 'cmax', 'cmid', 'cmin', 'coloraxis', 'colorbar', 'colorscale', 'connectgaps', 'contours', 'customdata', 'customdatasrc', 'hidesurface', 'hoverinfo', 'hoverinfosrc', 'hoverlabel', 'hovertemplate', 'hovertemplatesrc', 'hovertext', 'hovertextsrc', 'ids', 'idssrc', 'legendgroup', 'legendgrouptitle', 'legendrank', 'lighting', 'lightposition', 'meta', 'metasrc', 'name', 'opacity', 'opacityscale', 'reversescale', 'scene', 'showlegend', 'showscale', 'stream', 'surfacecolor', 'surfacecolorsrc', 'text', 'textsrc', 'type', 'uid', 'uirevision', 'visible', 'x', 'xcalendar', 'xhoverformat', 'xsrc', 'y', 'ycalendar', 'yhoverformat', 'ysrc', 'z', 'zcalendar', 'zhoverformat', 'zsrc', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule', '_bbox'
”
Warning message:
“'surface' objects don't have these attributes: 'mode'
Valid attributes include:
'_deprecated', 'autocolorscale', 'cauto', 'cmax', 'cmid', 'cmin', 'coloraxis', 'colorbar', 'colorscale', 'connectgaps', 'contours', 'customdata', 'customdatasrc', 'hidesurface', 'hoverinfo', 'hoverinfosrc', 'hoverlabel', 'hovertemplate', 'hovertemplatesrc', 'hovertext', 'hovertextsrc', 'ids', 'idssrc', 'legendgroup', 'legendgrouptitle', 'legendrank', 'lighting', 'lightposition', 'meta', 'metasrc', 'name', 'opacity', 'opacityscale', 'reversescale', 'scene', 'showlegend', 'showscale', 'stream', 'surfacecolor', 'surfacecolorsrc', 'text', 'textsrc', 'type', 'uid', 'uirevision', 'visible', 'x', 'xcalendar', 'xhoverformat', 'xsrc', 'y', 'ycalendar', 'yhoverformat', 'ysrc', 'z', 'zcalendar', 'zhoverformat', 'zsrc', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule', '_bbox'
”

Unfortunately, due to a lack of coding expertise, we are unable achieve a satisfactory 3D plot to display predictions using both age and height. As such, we will plot the predictions against rank in two separate plots:

In [20]:
#plot of age vs height (model with both predictors)
tennis_both_preds_a <- tennis_both_fit |>  
            predict(tennis_testing) |>  
            bind_cols(tennis_testing)

tennis_both_plot_a <- tennis_both_preds_a |>
    ggplot(aes(x = Age, y = Current.Rank)) +
    geom_point() +
    geom_line(data = tennis_both_preds_a,
              aes(x = Age, y = .pred),
              color = "blue") +
    labs(x = "Age (yr)", y = "Predicted rank") +
    ggtitle("Figure 12. Rank vs Age (yr)") +
    theme(text = element_text(size = 15))

tennis_both_plot_a

#plot of rank vs height (model with both predictors)
tennis_both_preds_h <- tennis_both_fit |>  
            predict(tennis_testing) |>  
            bind_cols(tennis_testing)

tennis_both_plot_h <- tennis_both_preds_h |>
    ggplot(aes(x = Height, y = Current.Rank)) +
    geom_point() +
    geom_line(data = tennis_both_preds_h,
              aes(x = Height, y = .pred),
              color = "red") +
    labs(x = "Height (cm)", y = "Predicted rank") +
    ggtitle("Figure 13. Rank vs Height (cm)") +
    theme(text = element_text(size = 15))

tennis_both_plot_h

Discussion¶

By using the k-nearest neighbor regression algorithm, we were able to determine an estimated accuracy of how effectively our model predicts a player’s rank based on their physical attributes. We found that using only the age predictor gave us the lowest root mean square prediction error, 94.57, compared to the other predictor combinations, with only using height yielding an RMSPE of 103.68 and using both predictors giving 95.09. Hence, the age predictor provides the best accuracy and gives us the best predictions of the players’ current rank. Considering real-world practicality, when looking at the original data for age, there were only 0.2% of values missing, height had 56.8% missing, and current rank had 0.8% missing. This suggests that the age variable may be the easiest to determine when gathering player statistics. Following this, the fact that the age predictor alone gave us the best accuracy gives confidence that it is likely possible to predict a possible rank for mostly any player that comes into the sport as the age variable is almost guaranteed to be known. However, with our limited resources and statistical and coding expertise, we likely did not use the most effective method to achieve our purpose. Additionally, we used a single source as our dataset and it had many missing values, so improving our sample size and data quality would yield better results. Due to these limitations, our model does not predict rank with great accuracy. Perhaps future research may yield a more accurate model.

In our preliminary analysis, we found that older players were more likely to have lower ranks based on a negative correlation in age shown by the age vs current rank graph. This meant that, as age increased, players played better. We also expected that players who were average in height were more likely to have higher ranks based on the very weak normal distribution of height vs current rank. This showed that players whose height was not considered average played better. As both of these predictors seemed to somehow correlate with the current rank variable, we expected that a model using both predictors would provide the best accuracy, i.e., the lowest RMSPE. Instead, we found that using only age produced the best results.

Depending on the model’s prediction accuracy, the findings can help us predict how well a new player will perform based on the variables of age and height. This readily available information can be helpful for companies and coaches scouting for future talent in the sport of tennis. With a model presenting good accuracy, one would be able to look out for specific qualities that suggest a player is good enough to attain a high rank. This would benefit personal interests and the sport as a whole. Additionally, many potential studies may build upon our work in the future.

How influential is each attribute concerning a player’s rank? What are other possible predictors beyond our dataset that may influence a player’s current rank? Would we get similar results if we used a different dataset? Are there other variables that may classify someone’s rank, such as the country they are from? One can explore whether ethnic background or how many seasons a current player has played, will impact the KNN regression model. Furthermore, one can also analyze how a simple or multivariable linear regression model will compare in prediction accuracy to K-nearest neighbours regression. Extending the concept of the study, one can attempt to determine how such a model performs when applied to players of other sports, such as badminton or hockey.

References¶

Fernandez-Fernandez, J., Nakamura, F. Y., Moreno-Perez, V., Lopez-Valenciano, A., Del Coso, J., Gallo-Salazar, C., ... & Sanz-Rivas, D. (2019). Age and sex-related upper body performance differences in competitive young tennis players. PLoS One, 14(9), e0221761. https://doi.org/10.1371/journal.pone.0221761

McPherson, S. L., & Thomas, J. R. (1989). Relation of knowledge and performance in boys' tennis: Age and expertise. Journal of experimental child psychology, 48(2), 190-211. https://doi.org/10.1016/0022-0965(89)90002-7

Schulz, R., & Curnow, C. (1988). Peak performance and age among superathletes: Track and field, swimming, baseball, tennis, and golf. Journal of Gerontology, 43(5), 113-120. https://doi.org/10.1093/geronj/43.5.P113

Ultimate Tennis Statistics. (n.d.). Retrieved December 4, 2022, from https://www.ultimatetennisstatistics.com/